Customer classification is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests and spending habits.
Companies using customer classification operate under the fact that every customer is different and their marketing efforts would be better served if they target specific, smaller groups with messages that those consumers would find relevant and lead them to buy something. Companies also hope to gain a deeper understanding of their customers' preferences and needs with the idea of discovering what each class finds most valuable to more accurately tailor marketing materials toward that class. In this lab logistic regression has been implemented to predict a continuous value and then use a threshold to convert that value to a binary output for a classification task
In this lab, customer classification will be done an automobile customer segmentation Dataset:
URL:https://www.kaggle.com/datasets/abisheksudarshan/customer-segmentation?select=train.csv
loading data and necessary library packages
#%pip install hdbscan
import pandas as pd
import numpy as np
import missingno as mn
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA as sklearnPCA
import seaborn as sns
from mpl_toolkits import mplot3d
import matplotlib.ticker as mtick
from sklearn import preprocessing
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
import seaborn as sns
from scipy.special import expit
import copy
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
import warnings
from plotly.graph_objs import Bar, Line
import plotly
from plotly.graph_objs import Scatter, Layout
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
from scipy.optimize import fmin_bfgs # maybe the most common bfgs algorithm in the world
from numpy import ma
from sklearn.linear_model import LogisticRegression
warnings.filterwarnings("ignore", category=FutureWarning)
df = pd.read_csv('C://Users/Hedieh/Documents/SMU/ML-Py/lab3/train.csv')
df.head()
| ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
| 1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
| 2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
| 3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
| 4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
Defining our attributes, their type and description
| Attribute | Attr. type | Description |
|---|---|---|
| ID | Nominal | identification number of individual surveyed |
| Gender | Nominal | Sex of individual surveyed. Male or female. |
| Ever_Married | Nominal | Individual reported previously (or currently) married. |
| Age | Ratio | Age of individual surveyed. |
| Graduated | Nominal | Individual reported currently Graduated or not. |
| Profession | Nominal | Individual reported Profession type. |
| Work_Experience | Nominal | Individual reported how many years work experience they have |
| Spending Score | Ordinal | Individual reported how much money they are able to spend |
| Family_Size | Nominal | Individual reported their family size. |
| Var_1 | Ordinal | Anonymised Category for the customer |
| Segmentation | Ordinal | Our classification targets labels |
df.describe()
| ID | Age | Work_Experience | Family_Size | |
|---|---|---|---|---|
| count | 8068.000000 | 8068.000000 | 7239.000000 | 7733.000000 |
| mean | 463479.214551 | 43.466906 | 2.641663 | 2.850123 |
| std | 2595.381232 | 16.711696 | 3.406763 | 1.531413 |
| min | 458982.000000 | 18.000000 | 0.000000 | 1.000000 |
| 25% | 461240.750000 | 30.000000 | 0.000000 | 2.000000 |
| 50% | 463472.500000 | 40.000000 | 1.000000 | 3.000000 |
| 75% | 465744.250000 | 53.000000 | 4.000000 | 4.000000 |
| max | 467974.000000 | 89.000000 | 14.000000 | 9.000000 |
checking for duplicated values and printing the value types and counts
# compare the length of list and set to check if the dataset contains duplicates
if len(set(df.ID)) != len(df.ID):
print("duplicates found in the list")
else:
print("No duplicates found in the list")
categorical_features_column=['Gender','Ever_Married','Age','Graduated','Profession','Work_Experience','Spending_Score','Family_Size','Var_1','Segmentation']
for col in categorical_features_column:
print(df[col].value_counts())
No duplicates found in the list
Male 4417
Female 3651
Name: Gender, dtype: int64
Yes 4643
No 3285
Name: Ever_Married, dtype: int64
35 250
37 234
33 232
42 232
40 229
...
78 29
87 28
76 27
80 24
85 22
Name: Age, Length: 67, dtype: int64
Yes 4968
No 3022
Name: Graduated, dtype: int64
Artist 2516
Healthcare 1332
Entertainment 949
Engineer 699
Doctor 688
Lawyer 623
Executive 599
Marketing 292
Homemaker 246
Name: Profession, dtype: int64
1.0 2354
0.0 2318
9.0 474
8.0 463
2.0 286
3.0 255
4.0 253
6.0 204
7.0 196
5.0 194
10.0 53
11.0 50
12.0 48
13.0 46
14.0 45
Name: Work_Experience, dtype: int64
Low 4878
Average 1974
High 1216
Name: Spending_Score, dtype: int64
2.0 2390
3.0 1497
1.0 1453
4.0 1379
5.0 612
6.0 212
7.0 96
8.0 50
9.0 44
Name: Family_Size, dtype: int64
Cat_6 5238
Cat_4 1089
Cat_3 822
Cat_2 422
Cat_7 203
Cat_1 133
Cat_5 85
Name: Var_1, dtype: int64
D 2268
A 1972
C 1970
B 1858
Name: Segmentation, dtype: int64
================================================
ID column is removed because it does not contain any useful information. Reference did not give me any exact information about Var_1 column and what are the categories so in order to prevent confusion, I removed it.
del df['ID']
del df['Var_1']
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8068 entries, 0 to 8067 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 8068 non-null object 1 Ever_Married 7928 non-null object 2 Age 8068 non-null int64 3 Graduated 7990 non-null object 4 Profession 7944 non-null object 5 Work_Experience 7239 non-null float64 6 Spending_Score 8068 non-null object 7 Family_Size 7733 non-null float64 8 Segmentation 8068 non-null object dtypes: float64(2), int64(1), object(6) memory usage: 567.4+ KB None
mn.matrix(df)
plt.title("Not Sorted",fontsize=22)
plt.figure()
mn.matrix(df.sort_values(by=["Work_Experience","Ever_Married"]))
plt.title("Sorted",fontsize=22)
plt.show()
<Figure size 432x288 with 0 Axes>
print( df['Ever_Married'].isna().sum().sum() ,"value from 8068 samples is missing in Ever_Married column")
print( df['Graduated'].isna().sum().sum() ,"value from 8068 samples is missing in Graduated column")
print( df['Profession'].isna().sum().sum() ,"value from 8068 samples is missing in Profession column")
print( df['Work_Experience'].isna().sum().sum() ,"value from 8068 samples is missing in Work_Experience column")
print( df['Family_Size'].isna().sum().sum() ,"value from 8068 samples is missing in Family_Size column")
140 value from 8068 samples is missing in Ever_Married column 78 value from 8068 samples is missing in Graduated column 124 value from 8068 samples is missing in Profession column 829 value from 8068 samples is missing in Work_Experience column 335 value from 8068 samples is missing in Family_Size column
===================================================
Based on the dataframe information, some missing values are found in the dataset above columns.
I believe that these missing data might occur because some customers did not want to share their information with this company.
As the purpose of this survey is to find which car category with specific range price is suitable for customer to purchase. I believe working experience and profession are the significant factors for classification because they have direct impact on customers financial situation. Therefore I decide to eliminate the blank rows for these two columns and impute values For others column's missing values. In order to do imputation for categorical features, One hot encoding technique will be used. One hot encoding is a technique used in machine learning to convert categorical variables into a format that can be used for predictive modeling. In this way it can help avoid bias in predictive models caused by categorical variables. We also need to normalized our numerical data to transform the data to a common scale, so that different features can be compared on a level playing field.
# Replace the missing data identified above with "not reported"
df.fillna({
"Work_Experience": "not reported", 'Profession': "not reported"
}, inplace=True)
Work_ExperienceValues=['not reported']
ProfessionValues=['not reported']
df = df[df.Work_Experience.isin(Work_ExperienceValues) == False]
df = df[df.Profession.isin(ProfessionValues) == False]
warnings.filterwarnings("ignore", category=FutureWarning)
df
| Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | D |
| 2 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | B |
| 3 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | B |
| 5 | Male | Yes | 56 | No | Artist | 0.0 | Average | 2.0 | C |
| 6 | Male | No | 32 | Yes | Healthcare | 1.0 | Low | 3.0 | C |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8062 | Male | Yes | 41 | Yes | Artist | 0.0 | High | 5.0 | B |
| 8064 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | D |
| 8065 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | D |
| 8066 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | B |
| 8067 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | B |
7141 rows × 9 columns
mn.matrix(df)
plt.title("Not Sorted",fontsize=22)
plt.figure()
mn.matrix(df.sort_values(by=["Work_Experience","Ever_Married"]))
plt.title("Sorted",fontsize=22)
plt.show()
<Figure size 432x288 with 0 Axes>
warnings.simplefilter('ignore')
# replace gender to numrical indicator 0 for Male, 1 for female
df.Gender.replace(to_replace = ['Male', 'Female'], value = range(0,2), inplace = True)
# replace ever married status to numrical indicator 0 for No, 1 for Yes
df.Ever_Married.replace(to_replace = ['No', 'Yes'], value = range(0,2), inplace = True)
# replace Graduated status to numrical indicator 0 for No, 1 for Yes
df.Graduated.replace(to_replace = ['No', 'Yes'], value = range(0,2), inplace = True)
# replace work type to numrical indicator 0 for Artist, 1 for Healthcare,
#2 for Entertainment, 3 for Engineer, 4 for Doctor, 5 for Lawyer, 6 for Executive, 7 for Marketing, 8 for Homemaker
df.Profession.replace(to_replace = ['Artist', 'Healthcare','Entertainment','Engineer','Doctor','Lawyer','Executive','Marketing','Homemaker'], value = range(0,9), inplace = True)
# # replace Spending Score to numrical indicator 0 for low, 1 for Average, 2 for High
df.Spending_Score.replace(to_replace = ['Low', 'Average','High'], value = range(0,3), inplace = True)
# replace Segmentation to numrical indicator 0 for A, 1 for B, 2 for C, 3 for D
df.Segmentation.replace(to_replace = ['A', 'B','C','D'], value = range(0,4), inplace = True)
df.head()
| Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 22 | 0.0 | 1 | 1.0 | 0 | 4.0 | 3 |
| 2 | 1 | 1.0 | 67 | 1.0 | 3 | 1.0 | 0 | 1.0 | 1 |
| 3 | 0 | 1.0 | 67 | 1.0 | 5 | 0.0 | 2 | 2.0 | 1 |
| 5 | 0 | 1.0 | 56 | 0.0 | 0 | 0.0 | 1 | 2.0 | 2 |
| 6 | 0 | 0.0 | 32 | 1.0 | 1 | 1.0 | 0 | 3.0 | 2 |
print(df.info())
<class 'pandas.core.frame.DataFrame'> Int64Index: 7141 entries, 0 to 8067 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 7141 non-null int64 1 Ever_Married 7026 non-null float64 2 Age 7141 non-null int64 3 Graduated 7080 non-null float64 4 Profession 7141 non-null int64 5 Work_Experience 7141 non-null object 6 Spending_Score 7141 non-null int64 7 Family_Size 6877 non-null float64 8 Segmentation 7141 non-null int64 dtypes: float64(3), int64(5), object(1) memory usage: 557.9+ KB None
df.describe()
| Gender | Ever_Married | Age | Graduated | Profession | Spending_Score | Family_Size | Segmentation | |
|---|---|---|---|---|---|---|---|---|
| count | 7141.000000 | 7026.000000 | 7141.000000 | 7080.000000 | 7141.000000 | 7141.000000 | 6877.000000 | 7141.000000 |
| mean | 0.453998 | 0.584401 | 43.484106 | 0.632062 | 2.330066 | 0.541101 | 2.844845 | 1.549783 |
| std | 0.497914 | 0.492860 | 16.598612 | 0.482278 | 2.366995 | 0.738799 | 1.526928 | 1.132011 |
| min | 0.000000 | 0.000000 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 31.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 1.000000 |
| 50% | 0.000000 | 1.000000 | 41.000000 | 1.000000 | 2.000000 | 0.000000 | 3.000000 | 2.000000 |
| 75% | 1.000000 | 1.000000 | 53.000000 | 1.000000 | 4.000000 | 1.000000 | 4.000000 | 3.000000 |
| max | 1.000000 | 1.000000 | 89.000000 | 1.000000 | 8.000000 | 2.000000 | 9.000000 | 3.000000 |
Let's try two methods of imputation on the bmi and smoking_status variables to see which one works better:
I will use median imputation for missing values to have robust dataset
# This code is for split, impute, combine
# let's clean the dataset a little before moving on
# Impute some missing values, grouped by their Pclass and SibSp numbers,
# then use this grouping to fill the data set in each group, then transform back
df_grouped = df.groupby(by=['Ever_Married','Age']) # perform the grouping of thing related to 'family size'
func = lambda grp: grp.fillna(grp.median()) # within groups, fill using median (define function to do this)
numeric_columns = ['Age','Work_Experience','Family_Size'] # only transform numeric columns
df_imputed_sic = df_grouped[numeric_columns].transform(func) # apply impute and transform the data back
# Extra step: fill any object columns that could not be transformed
col_deleted = list( set(df.columns) - set(df_imputed_sic.columns)) # in case the median operation deleted columns
df_imputed_sic[col_deleted] = df[col_deleted]
# drop any rows that still had missing values after grouped imputation
df_imputed_sic.dropna(inplace=True)
# 5. Rearrange the columns
df_imputed_sic = df_imputed_sic[['Gender','Ever_Married','Age','Graduated','Profession','Work_Experience','Spending_Score','Family_Size','Segmentation']]
df_imputed_sic.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6967 entries, 0 to 8067 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 6967 non-null int64 1 Ever_Married 6967 non-null float64 2 Age 6967 non-null int64 3 Graduated 6967 non-null float64 4 Profession 6967 non-null int64 5 Work_Experience 6967 non-null float64 6 Spending_Score 6967 non-null int64 7 Family_Size 6967 non-null float64 8 Segmentation 6967 non-null int64 dtypes: float64(4), int64(5) memory usage: 544.3 KB
df['Graduated']
0 0.0
2 1.0
3 1.0
5 0.0
6 1.0
...
8062 1.0
8064 0.0
8065 1.0
8066 1.0
8067 1.0
Name: Graduated, Length: 7141, dtype: float64
col=['Age','Work_Experience','Family_Size']
df_normalized=pd.DataFrame( columns=['Age','Work_Experience','Family_Size'])
for i in col:
df_normalized[i]=(df[i]-df.mean()[i])/df.std()[i]
df_normalized
| Age | Work_Experience | Family_Size | |
|---|---|---|---|
| 0 | -1.294331 | -0.482522 | 0.756522 |
| 2 | 1.416739 | -0.482522 | -1.208207 |
| 3 | 1.416739 | -0.775847 | -0.553297 |
| 5 | 0.754033 | -0.775847 | -0.553297 |
| 6 | -0.691871 | -0.482522 | 0.101612 |
| ... | ... | ... | ... |
| 8062 | -0.149657 | -0.775847 | 1.411432 |
| 8064 | -0.511133 | 0.104128 | 0.756522 |
| 8065 | -0.631625 | -0.482522 | -1.208207 |
| 8066 | -0.993101 | -0.482522 | 0.756522 |
| 8067 | -0.390641 | -0.775847 | 0.101612 |
7141 rows × 3 columns
header=['Gender','Ever_Married','Age','Graduated','Profession','Work_Experience','Spending_Score','Family_Size','Segmentation']
df_total_after_normalization=pd.DataFrame( columns=header)
for i in header:
if i in df_normalized.columns:
df_total_after_normalization[i]=df_normalized[i]
else:
df_total_after_normalization[i]=df[i]
df_total_after_normalization
| Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | -1.294331 | 0.0 | 1 | -0.482522 | 0 | 0.756522 | 3 |
| 2 | 1 | 1.0 | 1.416739 | 1.0 | 3 | -0.482522 | 0 | -1.208207 | 1 |
| 3 | 0 | 1.0 | 1.416739 | 1.0 | 5 | -0.775847 | 2 | -0.553297 | 1 |
| 5 | 0 | 1.0 | 0.754033 | 0.0 | 0 | -0.775847 | 1 | -0.553297 | 2 |
| 6 | 0 | 0.0 | -0.691871 | 1.0 | 1 | -0.482522 | 0 | 0.101612 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8062 | 0 | 1.0 | -0.149657 | 1.0 | 0 | -0.775847 | 2 | 1.411432 | 1 |
| 8064 | 0 | 0.0 | -0.511133 | 0.0 | 6 | 0.104128 | 0 | 0.756522 | 3 |
| 8065 | 1 | 0.0 | -0.631625 | 1.0 | 1 | -0.482522 | 0 | -1.208207 | 3 |
| 8066 | 1 | 0.0 | -0.993101 | 1.0 | 1 | -0.482522 | 0 | 0.756522 | 1 |
| 8067 | 0 | 1.0 | -0.390641 | 1.0 | 6 | -0.775847 | 1 | 0.101612 | 1 |
7141 rows × 9 columns
knn_obj = KNNImputer(n_neighbors=3)
features_to_use = ['Gender','Ever_Married','Age','Graduated','Profession','Work_Experience','Spending_Score','Family_Size','Segmentation']
# create a numpy matrix from pandas numeric values to impute
temp = df_total_after_normalization[features_to_use].to_numpy()
# use sklearn imputation object
knn_obj.fit(temp)
temp_imputed = knn_obj.transform(temp)
df_knn_imputed = copy.deepcopy(df_total_after_normalization) # not just an alias
df_knn_imputed[features_to_use] = temp_imputed
df_knn_imputed.dropna(inplace=True)
df_knn_imputed.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7141 entries, 0 to 8067 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 7141 non-null float64 1 Ever_Married 7141 non-null float64 2 Age 7141 non-null float64 3 Graduated 7141 non-null float64 4 Profession 7141 non-null float64 5 Work_Experience 7141 non-null float64 6 Spending_Score 7141 non-null float64 7 Family_Size 7141 non-null float64 8 Segmentation 7141 non-null float64 dtypes: float64(9) memory usage: 815.9 KB
f = plt.figure(figsize=(16,5))
bin_num = 30
plt.subplot(2,2,1)
df_imputed_sic.Family_Size.plot(kind='hist', alpha=0.25, label="Split-Impute-Combine",bins=bin_num)
df.Family_Size.plot(kind='hist', alpha=0.25, label="Original",bins=bin_num)
plt.legend()
plt.ylim()
plt.subplot(2,2,2)
df_knn_imputed.Family_Size.plot(kind='hist', alpha=0.25, label="KNN-Imputer",bins=bin_num)
df_total_after_normalization.Family_Size.plot(kind='hist', alpha=0.25, label="Original",bins=bin_num)
plt.legend()
plt.ylim()
plt.subplot(2,2,3)
df.Family_Size.plot.kde(bw_method=0.2,label='Family_Size_kde',color='red')
df_imputed_sic.Family_Size.plot.kde(bw_method=0.2,label='Family_Size_kde_sic',color='green')
plt.legend(loc='upper right')
plt.subplot(2,2,4)
df_knn_imputed.Family_Size.plot.kde(bw_method=0.2,label='Family_Size_kde_knn',color='blue')
df_total_after_normalization.Family_Size.plot.kde(bw_method=0.2,label='Family_Size_kde',color='red')
plt.suptitle( 'Family_Size Imputation' , fontsize=30)
plt.show()
f = plt.figure(figsize=(16,5))
bin_num = 30
plt.subplot(2,2,1)
df_imputed_sic.Ever_Married.plot(kind='hist', alpha=0.25, label="Split-Impute-Combine",bins=bin_num)
df.Ever_Married.plot(kind='hist', alpha=0.25, label="Original",bins=bin_num)
plt.legend()
plt.ylim()
plt.subplot(2,2,2)
df_knn_imputed.Ever_Married.plot(kind='hist', alpha=0.25, label="KNN-Imputer",bins=bin_num)
df_total_after_normalization.Ever_Married.plot(kind='hist', alpha=0.25, label="Original",bins=bin_num)
plt.legend()
plt.ylim()
plt.subplot(2,2,3)
df.Ever_Married.plot.kde(bw_method=0.2,label='Ever_Married_kde',color='red')
df_imputed_sic.Ever_Married.plot.kde(bw_method=0.2,label='Ever_Married_kde_sic',color='green')
plt.legend(loc='upper right')
plt.subplot(2,2,4)
df_knn_imputed.Ever_Married.plot.kde(bw_method=0.2,label='Ever_Married_kde_knn',color='blue')
df_total_after_normalization.Ever_Married.plot.kde(bw_method=0.2,label='Ever_Married_kde',color='red')
plt.suptitle( 'Ever_Married Imputation' , fontsize=30)
plt.show()
f = plt.figure(figsize=(16,5))
bin_num = 30
plt.subplot(2,2,1)
df_imputed_sic.Graduated.plot(kind='hist', alpha=0.25, label="Split-Impute-Combine",bins=bin_num)
df.Graduated.plot(kind='hist', alpha=0.25, label="Original",bins=bin_num)
plt.legend()
plt.ylim()
plt.subplot(2,2,2)
df_knn_imputed.Graduated.plot(kind='hist', alpha=0.25, label="KNN-Imputer",bins=bin_num)
df_total_after_normalization.Graduated.plot(kind='hist', alpha=0.25, label="Original",bins=bin_num)
plt.legend()
plt.ylim()
plt.subplot(2,2,3)
df.Graduated.plot.kde(bw_method=0.2,label='Graduated_kde',color='red')
df_imputed_sic.Graduated.plot.kde(bw_method=0.2,label='Graduated_kde_sic',color='green')
plt.legend(loc='upper right')
plt.subplot(2,2,4)
df_knn_imputed.Graduated.plot.kde(bw_method=0.2,label='Graduated_kde_knn',color='blue')
df_total_after_normalization.Graduated.plot.kde(bw_method=0.2,label='Graduated_kde',color='red')
plt.suptitle( 'Graduated Imputation' , fontsize=30)
plt.show()
It seems that split and combine imputer works better than KNN imputer
df_imputed_sic.describe()
| Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
|---|---|---|---|---|---|---|---|---|---|
| count | 6967.000000 | 6967.000000 | 6967.000000 | 6967.000000 | 6967.000000 | 6967.000000 | 6967.000000 | 6967.000000 | 6967.00000 |
| mean | 0.452275 | 0.585044 | 43.515860 | 0.634563 | 2.322808 | 2.635568 | 0.539113 | 2.830558 | 1.54873 |
| std | 0.497753 | 0.492750 | 16.568029 | 0.481587 | 2.365200 | 3.407818 | 0.737329 | 1.507505 | 1.13002 |
| min | 0.000000 | 0.000000 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.00000 |
| 25% | 0.000000 | 0.000000 | 31.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 1.00000 |
| 50% | 0.000000 | 1.000000 | 41.000000 | 1.000000 | 2.000000 | 1.000000 | 0.000000 | 3.000000 | 2.00000 |
| 75% | 1.000000 | 1.000000 | 53.000000 | 1.000000 | 4.000000 | 4.000000 | 1.000000 | 4.000000 | 3.00000 |
| max | 1.000000 | 1.000000 | 89.000000 | 1.000000 | 8.000000 | 14.000000 | 2.000000 | 9.000000 | 3.00000 |
now I break down my pandas data frame to the X part (data features) and y part (target classed or class labels)
X = df_imputed_sic[['Gender','Ever_Married','Age','Graduated','Profession','Work_Experience','Spending_Score','Family_Size']]
y = df_imputed_sic['Segmentation'] # note problem is NOT binary anymore, there are three classes!
print(X.shape)
print(y.shape)
print(type(X))
(6967, 8) (6967,) <class 'pandas.core.frame.DataFrame'>
X
| Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 22 | 0.0 | 1 | 1.0 | 0 | 4.0 |
| 2 | 1 | 1.0 | 67 | 1.0 | 3 | 1.0 | 0 | 1.0 |
| 3 | 0 | 1.0 | 67 | 1.0 | 5 | 0.0 | 2 | 2.0 |
| 5 | 0 | 1.0 | 56 | 0.0 | 0 | 0.0 | 1 | 2.0 |
| 6 | 0 | 0.0 | 32 | 1.0 | 1 | 1.0 | 0 | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8062 | 0 | 1.0 | 41 | 1.0 | 0 | 0.0 | 2 | 5.0 |
| 8064 | 0 | 0.0 | 35 | 0.0 | 6 | 3.0 | 0 | 4.0 |
| 8065 | 1 | 0.0 | 33 | 1.0 | 1 | 1.0 | 0 | 1.0 |
| 8066 | 1 | 0.0 | 27 | 1.0 | 1 | 1.0 | 0 | 4.0 |
| 8067 | 0 | 1.0 | 37 | 1.0 | 6 | 0.0 | 1 | 3.0 |
6967 rows × 8 columns
we need to reset index for our pandas dataframe Otherwise, indices cannot be recognized in the regression part and gives an error
X=X.reset_index(drop=True)
X
| Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 22 | 0.0 | 1 | 1.0 | 0 | 4.0 |
| 1 | 1 | 1.0 | 67 | 1.0 | 3 | 1.0 | 0 | 1.0 |
| 2 | 0 | 1.0 | 67 | 1.0 | 5 | 0.0 | 2 | 2.0 |
| 3 | 0 | 1.0 | 56 | 0.0 | 0 | 0.0 | 1 | 2.0 |
| 4 | 0 | 0.0 | 32 | 1.0 | 1 | 1.0 | 0 | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6962 | 0 | 1.0 | 41 | 1.0 | 0 | 0.0 | 2 | 5.0 |
| 6963 | 0 | 0.0 | 35 | 0.0 | 6 | 3.0 | 0 | 4.0 |
| 6964 | 1 | 0.0 | 33 | 1.0 | 1 | 1.0 | 0 | 1.0 |
| 6965 | 1 | 0.0 | 27 | 1.0 | 1 | 1.0 | 0 | 4.0 |
| 6966 | 0 | 1.0 | 37 | 1.0 | 6 | 0.0 | 1 | 3.0 |
6967 rows × 8 columns
y=y.reset_index(drop=True)
y
0 3
1 1
2 1
3 2
4 2
..
6962 1
6963 3
6964 3
6965 1
6966 1
Name: Segmentation, Length: 6967, dtype: int64
X_sic_normalized=pd.DataFrame( columns=['Age'])
X_sic_normalized['Age']=(X['Age']-X.mean()['Age'])/X.std()['Age']
header=['Gender','Ever_Married','Age','Graduated','Profession','Work_Experience','Spending_Score','Family_Size']
X_sic_normalization=pd.DataFrame( columns=header)
for i in header:
if i =='Age':
X_sic_normalization[i]=X_sic_normalized['Age']
else:
X_sic_normalization[i]=X[i]
X_sic_normalization
| Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | -1.298637 | 0.0 | 1 | 1.0 | 0 | 4.0 |
| 1 | 1 | 1.0 | 1.417437 | 1.0 | 3 | 1.0 | 0 | 1.0 |
| 2 | 0 | 1.0 | 1.417437 | 1.0 | 5 | 0.0 | 2 | 2.0 |
| 3 | 0 | 1.0 | 0.753508 | 0.0 | 0 | 0.0 | 1 | 2.0 |
| 4 | 0 | 0.0 | -0.695065 | 1.0 | 1 | 1.0 | 0 | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6962 | 0 | 1.0 | -0.151850 | 1.0 | 0 | 0.0 | 2 | 5.0 |
| 6963 | 0 | 0.0 | -0.513994 | 0.0 | 6 | 3.0 | 0 | 4.0 |
| 6964 | 1 | 0.0 | -0.634708 | 1.0 | 1 | 1.0 | 0 | 1.0 |
| 6965 | 1 | 0.0 | -0.996851 | 1.0 | 1 | 1.0 | 0 | 4.0 |
| 6966 | 0 | 1.0 | -0.393279 | 1.0 | 6 | 0.0 | 1 | 3.0 |
6967 rows × 8 columns
I used PCA method for dimensional reduction
from plotly.graph_objs import Bar, Line
import plotly
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
def plot_explained_variance(pca):
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
n_components=8
pca = PCA(n_components=n_components)
X_pca = pca.fit(X_sic_normalization )
plot_explained_variance(pca)
This graph shows that lower dimension contains the main features of the data which are essential for classification. increasing dimension makes our dataset more sparse and harder to classify.This graph shows that 96% of data are in the 4 dimension.
pca = PCA(n_components=4)
pca.fit(X_sic_normalization) # fit data and then transform it
X_pca = pca.transform(X_sic_normalization)
print ('pca:\n', pca.components_)
X_pca
pca: [[ 0.00879734 -0.01548101 -0.05935805 0.00428559 0.02634117 0.99694058 -0.01912948 -0.03450917] [-0.00338449 0.02673478 0.09718793 -0.03891447 0.99149056 -0.01864461 0.06988248 -0.00540943] [-0.01596247 -0.06700512 -0.30405119 -0.09482012 0.03344638 0.01316573 -0.00298034 0.94473099] [-0.04042977 0.3360445 0.71405018 0.12229177 -0.10831756 0.06975675 0.52082467 0.26973979]]
array([[-1.61600824, -1.4406857 , 1.54207987, -1.1193564 ],
[-1.62341735, 0.80692917, -2.22883376, 0.21210911],
[-2.64924108, 2.94629491, -1.22037394, 1.27753614],
...,
[-1.53880734, -1.40223046, -1.6047642 , -1.37263491],
[-1.62083874, -1.4536547 , 1.33953887, -0.82200401],
[-2.53079901, 3.68651379, 0.31133423, -0.37480864]])
cmap = sns.set(style="darkgrid")
# this function definition just formats the weights into readable strings
# you can skip it without loss of generality to the Data Science content
def get_feature_names_from_weights(weights, names):
tmp_array = []
for comp in weights:
tmp_string = ''
for fidx,f in enumerate(names):
if fidx>0 and comp[fidx]>=0:
tmp_string+='+'
tmp_string += '%.2f*%s ' % (comp[fidx],f[:-5])
tmp_array.append(tmp_string)
return tmp_array
plt.style.use('default')
# now let's get to the Data Analytics!
data=['Gender','Ever_Married','Age','Graduated','Profession','Work_Experience','Spending_Score','Family_Size']
pca_weight_strings = get_feature_names_from_weights(pca.components_, data)
# create some pandas dataframes from the transformed outputs
df_pca = pd.DataFrame(X_pca,columns=[pca_weight_strings])
from matplotlib.pyplot import scatter
# scatter plot the output, with the names created from the weights
ax = scatter(X_pca[:,1], X_pca[:,3], c=y, s=(y+2)*10, cmap=cmap)
plt.xlabel(pca_weight_strings[1])
plt.ylabel(pca_weight_strings[3])
Text(0, 0.5, '-0.04*G +0.34*Ever_Ma +0.71* +0.12*Grad -0.11*Profe +0.07*Work_Exper +0.52*Spending_ +0.27*Family ')
fig = plt.figure()
ax = fig.gca(projection='3d')
X_norm = (X - X.min())/(X.max() - X.min())
pca = sklearnPCA(n_components=4) #2-dimensional PCA
transformed = pd.DataFrame(pca.fit_transform(X_norm))
plt.scatter(transformed[y==1][0], transformed[y==1][1],transformed[y==1][2], label='Class 1', c='red')
plt.scatter(transformed[y==2][0], transformed[y==2][1],transformed[y==2][2], label='Class 2', c='blue')
plt.scatter(transformed[y==3][0], transformed[y==3][1],transformed[y==3][2], label='Class 3', c='lightgreen')
plt.scatter(transformed[y==4][0], transformed[y==4][1],transformed[y==4][2], label='Class 4', c='k')
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x1cbc695a520>
Linear Discriminant Analysis (LDA) is a supervised classification algorithm used to find a linear combination of features that best separates two or more classes of data. It assumes that the data follows a Gaussian distribution and that the covariance matrix of each class is equal.
The goal of LDA is to reduce the dimensionality of the data while preserving as much of the class discriminatory information as possible. It accomplishes this by finding the projection that maximizes the ratio of the between-class variance to the within-class variance.
LDA can be used for both binary and multi-class classification problems. It is commonly used in pattern recognition, face recognition, and bioinformatics.
One of the advantages of LDA is that it can handle high-dimensional data, making it useful in situations where the number of features is larger than the number of samples. LDA can also provide insights into the underlying structure of the data by revealing the most discriminative features.
However, LDA assumes that the classes are normally distributed and have equal covariance matrices, which may not always be the case in real-world data. Additionally, LDA may not perform well when the classes are heavily overlapped or when there are too few samples per class.
lda = LDA(n_components=3) #2-dimensional LDA
lda_transformed = pd.DataFrame(lda.fit_transform(X_norm, y))
fig = plt.figure()
ax = fig.gca(projection='3d')
plt.scatter(lda_transformed[y==1][0], lda_transformed[y==1][1],lda_transformed[y==1][2], label='Class 1', c='red')
plt.scatter(lda_transformed[y==2][0], lda_transformed[y==2][1],lda_transformed[y==2][2], label='Class 2', c='blue')
plt.scatter(lda_transformed[y==3][0], lda_transformed[y==3][1],lda_transformed[y==3][2], label='Class 3', c='lightgreen')
plt.scatter(lda_transformed[y==4][0], lda_transformed[y==4][1],lda_transformed[y==4][2], label='Class 4', c='k')
plt.tight_layout()
This graph shows our data spread in 3D plot
=====================================================
Divide your data into training and testing data using an 80% training and 20% testing split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8)
X_train=X_train.reset_index(drop=True)
y_train=y_train.reset_index(drop=True)
X_test=X_test.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(y.shape)
print(X.shape)
(5573, 8) (5573,) (1394, 8) (1394,) (6967,) (6967, 8)
X_train.describe()
| Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | |
|---|---|---|---|---|---|---|---|---|
| count | 5573.000000 | 5573.000000 | 5573.000000 | 5573.000000 | 5573.000000 | 5573.000000 | 5573.000000 | 5573.000000 |
| mean | 0.448412 | 0.588911 | 43.682577 | 0.632334 | 2.353131 | 2.610802 | 0.542257 | 2.816526 |
| std | 0.497376 | 0.492076 | 16.578532 | 0.482213 | 2.378680 | 3.379258 | 0.739766 | 1.503385 |
| min | 0.000000 | 0.000000 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 0.000000 | 0.000000 | 31.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 0.000000 | 1.000000 | 41.000000 | 1.000000 | 2.000000 | 1.000000 | 0.000000 | 2.000000 |
| 75% | 1.000000 | 1.000000 | 53.000000 | 1.000000 | 4.000000 | 4.000000 | 1.000000 | 4.000000 |
| max | 1.000000 | 1.000000 | 89.000000 | 1.000000 | 8.000000 | 14.000000 | 2.000000 | 9.000000 |
X_test.describe()
| Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | |
|---|---|---|---|---|---|---|---|---|
| count | 1394.000000 | 1394.000000 | 1394.000000 | 1394.000000 | 1394.000000 | 1394.000000 | 1394.000000 | 1394.000000 |
| mean | 0.467719 | 0.569584 | 42.849354 | 0.643472 | 2.201578 | 2.734577 | 0.526542 | 2.886657 |
| std | 0.499136 | 0.495312 | 16.515093 | 0.479146 | 2.307383 | 3.519181 | 0.727630 | 1.523115 |
| min | 0.000000 | 0.000000 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 0.000000 | 0.000000 | 30.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 0.000000 | 1.000000 | 40.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 3.000000 |
| 75% | 1.000000 | 1.000000 | 52.000000 | 1.000000 | 4.000000 | 5.000000 | 1.000000 | 4.000000 |
| max | 1.000000 | 1.000000 | 89.000000 | 1.000000 | 8.000000 | 14.000000 | 2.000000 | 9.000000 |
y_train.describe()
count 5573.000000 mean 1.545308 std 1.129438 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 3.000000 Name: Segmentation, dtype: float64
y_test.describe()
count 1394.000000 mean 1.562410 std 1.132649 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 3.000000 Name: Segmentation, dtype: float64
pca = PCA(n_components=4)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(5573, 4) (5573,) (1394, 4) (1394,)
The size of the dataset plays an important role in determining the appropriate data splitting strategy. If the dataset is small, it may not be possible to split the data into separate training and testing sets, as this can result in insufficient data for training the model. I think this dataset is large enogh that allows for split between training and testing data. The split between training and testing data is random ensuring that the model is not biased towards a particular subset of the data.It also seems that classes are nearly balanced between the training and testing sets. If the classes are not balanced, the model may be biased towards the majority class and perform poorly on the minority class. In addition, training set and testing set have nearly same statistical characteristics, it seems that spliting data to 80/20 works well with our data.
bars = ('A', 'B', 'C', 'D')
plt.subplot(1,2,1)
plt.bar(bars, pd.value_counts(y_train))
plt.xlabel('class')
plt.ylabel('count')
plt.title('training target')
plt.subplot(1,2,2)
plt.bar(bars, pd.value_counts(y_test))
plt.xlabel('class')
plt.ylabel('count')
plt.title('test target')
Text(0.5, 1.0, 'test target')
Logistic regression was originally designed for binary classification tasks, where the goal is to predict a binary outcome (e.g. yes/no, true/false). However, it can also be used for multiclass classification tasks by extending the binary decision framework.
One way to extend logistic regression for multiclass classification is to use the one-vs-all (OvA) or one-vs-rest (OvR) strategy. In this approach, the multiclass classification problem is transformed into several binary classification problems. Specifically, for each class k, a binary classifier is trained to distinguish between observations of class k and observations of all other classes combined.
For example, suppose we have $K$ classes $(K>2)$ and we want to use logistic regression for multiclass classification. We can train $K$ binary classifiers, denoted by $g_1(x), g_2(x),..., g_K(x)$, where each classifier $g_k$ predicts whether an observation $x$ belongs to class $k$ or not. The decision rule for the OvR strategy is to assign the class label with the highest predicted probability among all K classifiers:
$$ \widehat{y} = \underset{k=1,..,K}{argmax} \ g(w^{T}_k x)$$The probability that an observation x belongs to class k is estimated using the logistic function:
$$ P(y=k|x)= 1 / (1+exp(-w_k^T x))$$where $w_k$ is the weight vector for the k-th classifier.
The weight vector $w_k$ for the $k_{th}$ classifier is learned by minimizing the logistic regression loss function:
$$ J(w_k) = -1/n * \sum_{i=1}^{n}[ y_i*log(g(w^{T}_k x_i)) + (1-y_i)*log(1-g(w^{T}_k x_i)) ]$$where $ y_i=1$ if the $ i_{th}$ observation belongs to class $ k$ , and $ y_i=0$ otherwise. The optimization problem can be solved using gradient descent or other optimization algorithms.
In summary, logistic regression can be extended for multiclass classification using the OvA or OvR strategy, which involves training multiple binary classifiers and combining their outputs to make a multiclass decision.
%%time
# from last time, our logistic regression algorithm is given by (including everything we previously had):
class BinaryLogisticRegression:
def __init__(self, eta, iterations=20, C=0.001,l1=0,l2=0):
self.eta = eta
self.iters = iterations
self.C = C
self.l1=l1
self.l2=l2
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained Binary Logistic Regression Object'
# convenience, private:
@staticmethod
def _add_bias(X):
return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
@staticmethod
def _sigmoid(theta):
# increase stability, redefine sigmoid operation
return expit(theta) #1/(1+np.exp(-theta))
# vectorized gradient calculation with regularization using L2 Norm
def _get_gradient(self,X,y):
ydiff = y-self.predict_proba(X,add_bias=False).ravel() # get y difference
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += self.l2*(-2 * self.w_[1:] * self.C) + self.l1*(-np.sign(self.w_[1:]) * self.C)
return gradient
# public:
def predict_proba(self,X,add_bias=True):
# add bias term if requested
Xb = self._add_bias(X) if add_bias else X
return self._sigmoid(Xb @ self.w_) # return the probability y=1
def predict(self,X):
return (self.predict_proba(X)>0.5) #return the actual prediction
def fit(self, X, y):
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
# for as many as the max iterations
for _ in range(self.iters):
gradient = self._get_gradient(Xb,y)
self.w_ += gradient*self.eta # multiply by learning rate
# add bacause maximizing
Wall time: 0 ns
%%time
class StochasticLogisticRegression(BinaryLogisticRegression):
# stochastic gradient calculation
def _get_gradient(self,X,y):
#idx=y.sample()
idx = int(np.random.rand()*len(y)) # grab random instance
ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False) # get y difference (now scalar)
gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += self.l2*(-2 * self.w_[1:] * self.C) + self.l1*(-np.sign(self.w_[1:]) * self.C)
return gradient
Wall time: 0 ns
%%time
from numpy.linalg import pinv
class HessianBinaryLogisticRegression(BinaryLogisticRegression):
# just overwrite gradient function
def _get_gradient(self,X,y):
g = self.predict_proba(X,add_bias=False).ravel() # get sigmoid value for all classes
hessian = X.T @ np.diag(g*(1-g)) @ X - 2 * self.C # calculate the hessian
ydiff = y-g # get y difference
gradient = np.sum(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += self.l2*(-2 * self.w_[1:] * self.C) + self.l1*(-np.sign(self.w_[1:]) * self.C)
return pinv(hessian) @ gradient
hlr = HessianBinaryLogisticRegression(eta=1.0,
iterations=4,
C=0.001) # note that we need only a few iterations here
Wall time: 0 ns
%%time
# for this, we won't perform our own BFGS implementation
# (it takes a fair amount of code and understanding, which we haven't setup yet)
# luckily for us, scipy has its own BFGS implementation:
from scipy.optimize import fmin_bfgs # maybe the most common bfgs algorithm in the world
from numpy import ma
class BFGSBinaryLogisticRegression(BinaryLogisticRegression):
@staticmethod
def objective_function(w,X,y,C,l1,l2):
g = expit(X @ w)
# invert this because scipy minimizes, but we derived all formulas for maximzing
return -np.sum(ma.log(g[y==1]))-np.sum(ma.log(1-g[y==0])) + l2* C*sum(w**2) +l1* C* sum(abs(w))
#-np.sum(y*np.log(g)+(1-y)*np.log(1-g))
@staticmethod
def objective_gradient(w,X,y,C,l1,l2):
g = expit(X @ w)
ydiff = y-g # get y difference
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
gradient = gradient.reshape(w.shape)
gradient[1:] += -2 *l2* w[1:] * C -l1*np.sign(w[1:])*C
return -gradient
# just overwrite fit function
def fit(self, X, y):
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = fmin_bfgs(self.objective_function, # what to optimize
np.zeros((num_features,1)), # starting point
fprime=self.objective_gradient, # gradient function
args=(Xb,y,self.C,self.l1,self.l2), # extra args for gradient and objective function
gtol=1e-03, # stopping criteria for gradient, |v_k|
maxiter=self.iters, # stopping criteria iterations
disp=False)
self.w_ = self.w_.reshape((num_features,1))
Wall time: 0 ns
# allow for the user to specify the algorithm they want to solver the binary case
class MultiClassLogisticRegression:
def __init__(self, eta, iterations=20,
C=0.0001, l1=0,l2=0,
solver=BFGSBinaryLogisticRegression):
self.eta = eta
self.iters = iterations
self.C = C
self.l1=l1
self.l2=l2
self.solver = solver
self.classifiers_ = []
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.sort(np.unique(y)) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = []
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = np.array(y==yval).astype(int) # create a binary problem
# train the binary classifier for this class
hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C,l1=self.l1,l2=self.l2)
hblr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(hblr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for hblr in self.classifiers_:
probs.append(hblr.predict_proba(X).reshape((len(X),1))) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
lr = MultiClassLogisticRegression(eta=1,
iterations=10,
C=0.01,l1=0,l2=0,
solver=BFGSBinaryLogisticRegression
)
lr.fit(X_train,y_train)
print(lr)
yhat_test = lr.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat_test))
#print(y_test,yhat_test)
MultiClass Logistic Regression Object with coefficients: [[-1.1785995 0.00562273 0.04241971 0.03507137 -0.27500329] [-1.2139377 0.02301559 -0.00858972 -0.04869755 -0.00510437] [-1.22678139 0.02822558 -0.03963789 -0.19303323 0.23463552] [-1.32311832 -0.07246282 0.00157676 0.15239195 0.02217994]] Accuracy of: 0.45911047345767575
lr = MultiClassLogisticRegression(eta=1,
iterations=10,
C=0.01,l1=0,l2=0,
solver=StochasticLogisticRegression
)
lr.fit(X_train,y_train)
print(lr)
yhat_test = lr.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat_test))
MultiClass Logistic Regression Object with coefficients: [[ -1.48840484 15.17633973 -8.1413294 -4.50602518 -3.65019185] [ -7.5 3.65080696 1.33965755 3.82366282 3.84252444] [ -4.49960281 5.66123662 0.16340008 0.80679485 -1.67336826] [ -3.49766804 -19.83294088 -2.15871557 0.81914914 -7.31826309]] Accuracy of: 0.32711621233859395
lr = MultiClassLogisticRegression(eta=1,
iterations=10,
C=0.01,l1=0,l2=0,
solver=BinaryLogisticRegression
)
lr.fit(X_train,y_train)
print(lr)
yhat_test = lr.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat_test))
MultiClass Logistic Regression Object with coefficients: [[-2.45060412 -3.55589714 0.68254858 0.35913704 -0.89150568] [-2.45697309 5.44049946 -0.09314647 -0.31308211 -0.04246789] [-2.32635668 6.85873796 -0.32377818 -1.2218655 0.69583521] [-2.38046315 -8.17809216 -0.10465903 1.04907213 0.20589703]] Accuracy of: 0.2374461979913917
%%time
lr = MultiClassLogisticRegression(eta=1,
iterations=10,
C=0.005,
l1=1,
l2=0,
solver=HessianBinaryLogisticRegression
)
lr.fit(X_train,y_train)
print(lr)
yhat_test = lr.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat_test))
MultiClass Logistic Regression Object with coefficients: [[-1.17841884 0.00562162 0.04243464 0.03505603 -0.27502804] [-1.21394557 0.02301567 -0.00858804 -0.04870027 -0.00520449] [-1.22651017 0.02821202 -0.03970419 -0.19312131 0.23465296] [-1.32255866 -0.07242697 0.00155623 0.15245491 0.02203938]] Accuracy of: 0.45911047345767575 Wall time: 8.08 s
%%time
# how do we compare now to sklearn?
from sklearn.linear_model import LogisticRegression
lr_sk = LogisticRegression(solver='lbfgs',n_jobs=1,
multi_class='ovr', C = 1/0.001,
penalty='none',
max_iter=50) # all params default
# note that sklearn is optimized for using the liblinear library with logistic regression
# ...and its faster than our implementation here
lr_sk.fit(X_train, y_train) # no need to add bias term, sklearn does it internally!!
print(lr_sk.coef_)
yhat = lr_sk.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat))
[[ 0.00562171 0.04243569 0.03505756 -0.27503076] [ 0.023016 -0.00858744 -0.04870584 -0.00511351] [ 0.02821203 -0.03970437 -0.19312147 0.23465619] [-0.07242703 0.00155564 0.15245498 0.02203773]] Accuracy of: 0.45911047345767575 Wall time: 87.1 ms
%%time
# how do we compare now to sklearn?
from sklearn.linear_model import LogisticRegression
lr_sk = LogisticRegression(solver='lbfgs',n_jobs=1,
multi_class='ovr', C = 1/0.001,
penalty='l2',
max_iter=50) # all params default
# note that sklearn is optimized for using the liblinear library with logistic regression
# ...and its faster than our implementation here
lr_sk.fit(X_train, y_train) # no need to add bias term, sklearn does it internally!!
print(lr_sk.coef_)
yhat = lr_sk.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat))
[[ 0.00562171 0.04243569 0.03505756 -0.27503059] [ 0.023016 -0.00858744 -0.04870583 -0.00511351] [ 0.02821203 -0.03970436 -0.19312141 0.23465606] [-0.07242703 0.00155564 0.15245495 0.02203771]] Accuracy of: 0.43100663915305937 Wall time: 92.6 ms
%%time
# how do we compare now to sklearn?
from sklearn.linear_model import LogisticRegression
lr_sk = LogisticRegression(solver='liblinear',n_jobs=1,
multi_class='ovr', C = 1/0.001,
penalty='l1',
max_iter=50) # all params default
# note that sklearn is optimized for using the liblinear library with logistic regression
# ...and its faster than our implementation here
lr_sk.fit(X_train, y_train) # no need to add bias term, sklearn does it internally!!
print(lr_sk.coef_)
yhat = lr_sk.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat))
[[ 0.00562157 0.04243309 0.03505698 -0.27501492] [ 0.02301529 -0.00858873 -0.048697 -0.00520541] [ 0.02821102 -0.03970332 -0.19311551 0.23464123] [-0.07241713 0.00155002 0.15242 0.0220464 ]] Accuracy of: 0.43100663915305937 Wall time: 23.5 ms
%%time
# how do we compare now to sklearn?
from sklearn.linear_model import LogisticRegression
lr_sk = LogisticRegression(solver='saga',n_jobs=1,
multi_class='ovr', C = 1/0.001,
penalty='elasticnet',max_iter=1000, l1_ratio=0.5) # all params default
# note that sklearn is optimized for using the liblinear library with logistic regression
# ...and its faster than our implementation here
lr_sk.fit(X_train, y_train) # no need to add bias term, sklearn does it internally!!
print(lr_sk.coef_)
yhat = lr_sk.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat))
[[ 0.00561907 0.04243561 0.03506242 -0.27505093] [ 0.02301236 -0.00859006 -0.04870284 -0.00520592] [ 0.02820901 -0.03970652 -0.19311903 0.23465027] [-0.07242725 0.00155761 0.15245605 0.02204254]] Accuracy of: 0.43100663915305937 Wall time: 133 ms
To compare regression methods performance and finding optimal adjusting parameters value I visualize four regression method ('BFSBinary','Stochastic','Binary','Hessian','SKlearn'). everytime I will change just one value and I will extract the maximum value and use it for next changes. I start by changing Eta values:
Eta=[0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5,10]
Accuracy_Score=[]
Accuracy_Score1=[]
Accuracy_Score2=[]
Accuracy_Score3=[]
Accuracy_Score4=[]
for i in Eta:
lr = MultiClassLogisticRegression(eta=i,
iterations=10,
C=0.01,l1=1,l2=1,
solver=BFGSBinaryLogisticRegression)
lr.fit(X_train,y_train)
#print(lr)
yhat_test = lr.predict(X_test)
Accuracy_Score.append(accuracy_score(y_test,yhat_test))
#print('Accuracy of: ',accuracy_score(y_test,yhat_test))
lr1 = MultiClassLogisticRegression(eta=i,
iterations=10,
C=0.01,l1=1,l2=1,
solver=StochasticLogisticRegression)
lr1.fit(X_train,y_train)
yhat_test1 = lr1.predict(X_test)
Accuracy_Score1.append(accuracy_score(y_test,yhat_test1))
lr2 = MultiClassLogisticRegression(eta=i,
iterations=10,
C=0.01,l1=1,l2=1,
solver=BinaryLogisticRegression
)
lr2.fit(X_train,y_train)
yhat_test2 = lr2.predict(X_test)
Accuracy_Score2.append(accuracy_score(y_test,yhat_test2))
lr3 = MultiClassLogisticRegression(eta=i,
iterations=10,
C=0.01,
l1=1,
l2=1,
solver=HessianBinaryLogisticRegression
)
lr3.fit(X_train,y_train)
yhat_test3 = lr3.predict(X_test)
Accuracy_Score3.append(accuracy_score(y_test,yhat_test3))
lr_sk = LogisticRegression(solver='saga',n_jobs=1,
multi_class='ovr', C = 1/0.01,
penalty='elasticnet',max_iter=10, l1_ratio=0.5) # all params default
# note that sklearn is optimized for using the liblinear library with logistic regression
# ...and its faster than our implementation here
lr_sk.fit(X_train, y_train) # no need to add bias term, sklearn does it internally!!
yhat_test4 = lr_sk.predict(X_test)
Accuracy_Score4.append(accuracy_score(y_test,yhat_test4))
e=['0.0001','0.0005','0.001','0.005','0.01','0.05','0.1','0.5','1','5','10']
plt.plot(e,Accuracy_Score,e,Accuracy_Score1,e,Accuracy_Score2,e,Accuracy_Score3,e,Accuracy_Score4)
plt.legend(['BFSBinary','Stochastic','Binary','Hessian','SKlearn'])
plt.title('changing eta effect on classification accuracy')
Text(0.5, 1.0, 'changing eta effect on classification accuracy')
The result shows that in eta=0.01 most of the regression method are in their maximum accuracy, so for next change I will set eta=0.005 as a constant value and I will investigate changing C effects.
c=[0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5,10]
Accuracy_Score=[]
Accuracy_Score1=[]
Accuracy_Score2=[]
Accuracy_Score3=[]
Accuracy_Score4=[]
for i in c:
lr = MultiClassLogisticRegression(eta=0.01,
iterations=10,
C=i,l1=1,l2=1,
solver=BFGSBinaryLogisticRegression)
lr.fit(X_train,y_train)
#print(lr)
yhat_test = lr.predict(X_test)
Accuracy_Score.append(accuracy_score(y_test,yhat_test))
#print('Accuracy of: ',accuracy_score(y_test,yhat_test))
lr1 = MultiClassLogisticRegression(eta=0.01,
iterations=10,
C=i,l1=1,l2=1,
solver=StochasticLogisticRegression)
lr1.fit(X_train,y_train)
yhat_test1 = lr1.predict(X_test)
Accuracy_Score1.append(accuracy_score(y_test,yhat_test1))
lr2 = MultiClassLogisticRegression(eta=0.01,
iterations=10,
C=i,l1=1,l2=1,
solver=BinaryLogisticRegression
)
lr2.fit(X_train,y_train)
yhat_test2 = lr2.predict(X_test)
Accuracy_Score2.append(accuracy_score(y_test,yhat_test2))
lr3 = MultiClassLogisticRegression(eta=0.01,
iterations=10,
C=i,
l1=1,
l2=1,
solver=HessianBinaryLogisticRegression
)
lr3.fit(X_train,y_train)
yhat_test3 = lr3.predict(X_test)
Accuracy_Score3.append(accuracy_score(y_test,yhat_test3))
lr_sk = LogisticRegression(solver='saga',n_jobs=1,
multi_class='ovr', C = 1/i,
penalty='elasticnet',max_iter=10, l1_ratio=0.5) # all params default
# note that sklearn is optimized for using the liblinear library with logistic regression
# ...and its faster than our implementation here
lr_sk.fit(X_train, y_train) # no need to add bias term, sklearn does it internally!!
yhat_test4 = lr_sk.predict(X_test)
Accuracy_Score4.append(accuracy_score(y_test,yhat_test4))
e=['0.0001','0.0005','0.001','0.005','0.01','0.05','0.1','0.5','1','5','10']
plt.plot(e,Accuracy_Score,e,Accuracy_Score1,e,Accuracy_Score2,e,Accuracy_Score3,e,Accuracy_Score4)
plt.legend(['BFSBinary','Stochastic','Binary','Hessian','SKlearn'])
plt.title('changing C effect on classification accuracy')
Text(0.5, 1.0, 'changing C effect on classification accuracy')
The result shows that in C=0.001 most of the regression method are in their maximum accuracy, so for next change I will set eta=0.005 as a constant value and I will investigate changing C effects.
T=[(0,0),(0,1),(1,0),(1,1)]
Accuracy_Score=[]
Accuracy_Score1=[]
Accuracy_Score2=[]
Accuracy_Score3=[]
Accuracy_Score4=[]
for (i,j) in T:
lr = MultiClassLogisticRegression(eta=0.01,
iterations=10,
C=0.001,l1=i,l2=j,
solver=BFGSBinaryLogisticRegression)
lr.fit(X_train,y_train)
#print(lr)
yhat_test = lr.predict(X_test)
Accuracy_Score.append(accuracy_score(y_test,yhat_test))
#print('Accuracy of: ',accuracy_score(y_test,yhat_test))
lr1 = MultiClassLogisticRegression(eta=0.01,
iterations=10,
C=0.001,l1=i,l2=j,
solver=StochasticLogisticRegression)
lr1.fit(X_train,y_train)
yhat_test1 = lr1.predict(X_test)
Accuracy_Score1.append(accuracy_score(y_test,yhat_test1))
lr2 = MultiClassLogisticRegression(eta=0.01,
iterations=10,
C=0.001,l1=i,l2=j,
solver=BinaryLogisticRegression
)
lr2.fit(X_train,y_train)
yhat_test2 = lr2.predict(X_test)
Accuracy_Score2.append(accuracy_score(y_test,yhat_test2))
lr3 = MultiClassLogisticRegression(eta=0.01,
iterations=10,
C=0.001,
l1=i,
l2=j,
solver=HessianBinaryLogisticRegression)
lr3.fit(X_train,y_train)
yhat_test3 = lr3.predict(X_test)
Accuracy_Score3.append(accuracy_score(y_test,yhat_test3))
lr_sk = LogisticRegression(solver='saga',n_jobs=1,
multi_class='ovr', C = 1/0.001,
penalty='elasticnet',max_iter=10, l1_ratio=0.5) # all params default
# note that sklearn is optimized for using the liblinear library with logistic regression
# ...and its faster than our implementation here
lr_sk.fit(X_train, y_train) # no need to add bias term, sklearn does it internally!!
yhat_test4 = lr_sk.predict(X_test)
Accuracy_Score4.append(accuracy_score(y_test,yhat_test4))
t=['(0,0)','(0,1)','(1,0)','(1,1)']
plt.plot(t,Accuracy_Score,t,Accuracy_Score1,t,Accuracy_Score2,t,Accuracy_Score3,t,Accuracy_Score4)
plt.legend(['BFSBinary','Stochastic','Binary','Hessian','SKlearn'])
plt.title('changing l1,l2 effect on classification accuracy')
Text(0.5, 1.0, 'changing l1,l2 effect on classification accuracy')
Results shows that we have a maximum accuracy of 46.5% using SKlearn Regression method work and then BFSBinary work best for our result.Hessian binary logistic regression can be sensitive to small changes. In some cases, the covariance matrix may not be well estimated, which can lead to biased results.Stochastic logistic regression can be more noisy than batch logistic regression because it updates the model parameters using small subsets of the data, which can introduce more variance into the model. Stochastic logistic regression can be more sensitive to the initial parameter values and the learning rate, which can lead to convergence issues if they are not set correctly.it also can be less stable than batch logistic regression because of the random nature of the updates, which can lead to oscillations in the model performance.
SK_learn generally performs well in our dataset, it is considered as the most accurate in our data. the advantage of using SK_learn is that it has stable high performance and does not lose its high accuracy with drastic changes which would not lead to oscillations.Scikit-learn's implementation of logistic regression is very fast, computationally efficient and can handle large datasets with ease. it can be regularized to prevent overfitting and improve generalization performance and it can be applied to a wide range of problems, including binary classification, multi-class classification, and probabilistic regression. Overall, scikit-learn's logistic regression model is a reliable and versatile algorithm that can produce good results for a variety of binary classification tasks.Taking all mentioned aspects into consideration, I advise to use scikit-learn.
To optimize logistic regression with both L1 and L2 regularization using mean squared error as the objective function, we need to define the loss function as:
$L(w) = \frac{1}{2n}\sum_{i=1}^{n}(y^{(i)}-g(w^{T}x^{(i)}))^2 - l1*C\sum_{j=1}^{m}|w_j| - l2 *C \sum_{j=1}^{m}w_j^2$
where $l1$ and $l2$ are the regularization parameters that control the strength of L1 and L2 regularization, respectively, and $m$ is the number of features in the data. The first term is the mean squared error, the second term is the L1 regularization term that promotes sparsity in the weight vector, and the third term is the L2 regularization term that penalizes large weights.
The gradient of the loss function with respect to the weight vector $w$ is given by:
$\nabla L(w) = \frac{1}{n}\sum_{i=1}^{n}(g(w^{T}x^{(i)})-y^{(i)})*diag(g(w^{T}x^{(i)})(1-g(w^{T}x^{(i)})))x^{(i)} -l1*C*\text{sign}(w) - 2C*l2*w$
where $\text{sign}(w)$ is the sign function that returns 1 if $w$ is positive, -1 if $w$ is negative, and 0 if $w$ is zero.
The Hessian of the loss function is given by:
$H_{ij} = \frac{\partial^2 L(w)}{\partial w_i\partial w_j} = \frac{1}{n}\sum_{k=1}^{n}diag(g(w^{T}x^{(k)})(1-g(w^{T}x^{(k)}))) *diag(-3g^{2}(w^{T}x^{(k)})+2g(w^{T}x^{(k)}+y(2g(w^{T}x^{(k)}-1))*x^{(k)}_ix^{(k)}_j -2C* l2*\delta{ij}$
where $\delta_{ij}$ is the Kronecker delta function, which is 1 when $i=j$ and 0 otherwise.
from numpy import random
class MseHessianLogisticRegression(BinaryLogisticRegression):
def fit(self, X, y):
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1)) + random.rand() # init random weight vector
# for as many as the max iterations
for _ in range(self.iters):
g = self.predict_proba(Xb,add_bias=False).ravel() # get sigmoid value for all classes
gradient = (2/num_features) * (Xb.T @ np.diag(g*(1-g)) @ (g-y))
hessian = (2/num_features) * (Xb.T @ np.diag(g*(1-g)) @ np.diag(-3*g*g*g+2*g+y*(2*g-1)) @ Xb)
self.w_ -= pinv(hessian) @ gradient.reshape((num_features,1))
%%time
msehlr = MseHessianLogisticRegression(eta=0.1,iterations=10)
msehlr.fit(X_train, y_train)
yhat = msehlr.predict(X_test)
print('Accuracy: ',accuracy_score(y_test,yhat))
Accuracy: 0.2546628407460545 Wall time: 3.36 s
%%time
lr = MultiClassLogisticRegression(eta=0.1,
iterations=10,
C=0.005,
l1=0,
l2=0,
solver=HessianBinaryLogisticRegression
)
lr.fit(X_train,y_train)
#print(lr)
yhat_test = lr.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat_test))
Accuracy of: 0.45911047345767575 Wall time: 4.45 s
Our MLE model has higher accuracy than our MSE model because MSE (mean squared error) is a loss function and its goal is to predict a continuous target variable. MSE measures the average squared difference between the predicted and actual values. Using MSE as the loss function for logistic regression would not be appropriate, as the predicted values are probabilities that are bounded between 0 and 1. Squaring these probabilities would produce values that are not meaningful for probability estimates, and would not provide a suitable measure of how well the model is performing.Instead, logistic regression typically uses a loss function called binary cross-entropy, which measures the difference between the predicted probabilities and the actual binary outcomes. This loss function is designed to penalize the model more heavily for making incorrect predictions near the decision boundary (i.e., when the predicted probability is close to 0.5), where the model is less confident. So in the logistic regression model is typically trained using maximum likelihood estimation (MLE), which seeks to maximize the likelihood of the observed data given the model parameters.
In conclusion, logistic regression is a powerful statistical technique that can be used for customer classification in marketing and customer relationship management. By predicting the probability of a customer belonging to a certain category or segment based on their demographic, behavioral, or transactional characteristics, businesses can identify and target customers who are most likely to respond positively to their marketing efforts and to retain their loyalty over time.
To build an effective logistic regression model for customer classification, it is important to carefully select the outcome and predictor variables, estimate the model using maximum likelihood estimation, and evaluate the model's performance using various metrics. By incorporating regularization techniques such as L1 or L2 regularization, logistic regression can help to reduce overfitting and improve the generalization ability of the model.
Overall, customer classification based on logistic regression is a valuable tool for businesses seeking to improve their marketing and customer retention strategies, and can lead to more efficient and effective use of marketing resources, higher customer satisfaction, and increased revenue.